Public transport analysis¶

Data Science project by Elina Yancheva

DataFrame Columns:
 - date
 - media_id
 - media_type
 - product_id
 - pan
 - tap_lat
 - tap_lon
 - is_network_og
 - transport_type
 - og_line_id
 - origin_ts
 - origin_stop_id
 - origin_stop_lat
 - origin_stop_lon
 - task_id
 - boarding_dist
 - boarding_t
 - is_transfer
 - transfer_t
 - transfer_dist
 - dest_ts
 - dest_stop_id
 - dest_stop_lat
 - dest_stop_lon
 - id

Data cleaning and preparation¶

Invalid entries search¶

Dataset shape: (55978, 25)

Invalid data counts by specific validation rules:
transfer_t: 236 invalid values (0.42%)

Total cells in dataset: 1399450
Total invalid cells: 236
Percentage of invalid data: 0.02%

Rows with origin timestamp >= destination timestamp: 0 (0.00%)
Duplicate IDs: 23888 (42.67%)

Negative values search¶

Check for negative values in numeric columns where negative values would be problematic

Negative Value Checks:

Extreme negative transfer times (< -10 min): 24 records

Invalid transfer times by transport type:
 transport_type
metro         138
bus            70
trolleybus     22
tram           12
Name: count, dtype: int64

Transfers with zero transfer time: 6 records

Duplicate IDs search¶

Number of unique IDs that appear multiple times: 13229
Total number of rows with duplicate IDs: 23885

Analyzing duplicate ID rows:

Out of 13229 IDs with duplicates:
- 0 IDs have completely identical rows
- 13229 IDs have differences between rows
The dataset contains a single date: 2024-03-01

All duplicate IDs (13,230 unique IDs appearing 23,888 times) represent different transport journeys rather than identical data entries. Since these represent legitimate separate journeys might have been made by the same user/card, we will retain all rows in our analysis without filtering for unique IDs.

Total rows with duplicate IDs: 37114
Percentage of duplicate ID rows where is_transfer = True: 46.61%
Percentage of duplicate ID rows where is_transfer = False: 53.39%

Analyzing transfer patterns by journey sequence:

Most common transfer patterns (T=transfer, N=non-transfer):
TN: 4500 IDs (34.02%)
NN: 1641 IDs (12.40%)
TNT: 965 IDs (7.29%)
NT: 825 IDs (6.24%)
TNN: 564 IDs (4.26%)
TT: 521 IDs (3.94%)
NNN: 514 IDs (3.89%)
NTN: 369 IDs (2.79%)
TNTN: 358 IDs (2.71%)
TTN: 308 IDs (2.33%)

Transfer Pattern Analysis for Duplicate IDs¶

Check if all duplicate ids involve transfers, to check if a single journey with many segments is represented by multiple rows or if they are different journeys with the same ID.

Analysis of the 37,118 journeys with duplicate IDs reveals that 46.61% involve transfers, with the most common pattern (34.01%) being a transfer followed by a non-transfer journey (TN). This suggests that many users make connected trips where they transfer once and then complete their journey directly, while the presence of various multi-segment patterns (TNT, TNN, TNTN) indicates more complex travel behaviors spanning multiple transit modes. The following analysis assumes that all duplicate IDs represent different journeys, by the same person.

Missing values¶

media_id: 23.64% missing
product_id: 79.65% missing
pan: 76.36% missing
og_line_id: 67.44% missing
task_id: 67.44% missing
boarding_dist: 67.44% missing
transport_type
metro    37751
Name: count, dtype: int64
---
All metro records don't have og_line_id:        True

Metro has the most missing data across all fields, with complete absence of product_id, og_line_id, task_id, and boarding_dist values. This suggests metro journeys are tracked differently in the system.

Value counts for dest_datetime (date only):
dest_datetime
2025-03-01    55973
2025-03-02        2
Name: count, dtype: int64

Value counts for date:
date
2024-03-01    55975
Name: count, dtype: int64

Date Inconsistency Analysis¶

While all records show the journey date as 2024-03-01, the destination timestamps overwhelmingly resolve to 2025-03-01 with two outliers on 2025-03-02 (maybe short after midnight). This one-year difference between date (journey dates) and destination timestamps indicates a systematic timestamp error.

Exploratory data analysis¶

Transport type distribution¶

Transport type distribution by line number¶

Most Frequent Line IDs by Transport Type
 Summary of the day

Bus:
- Most frequent line: 280 with 676 trips (7.4% of all bus trips)
- Total number of unique lines: 78
- Total trips recorded: 9133

Tram:
- Most frequent line: 5 with 761 trips (13.7% of all tram trips)
- Total number of unique lines: 16
- Total trips recorded: 5542

Trolleybus:
- Most frequent line: 73 with 602 trips (17.0% of all trolleybus trips)
- Total number of unique lines: 16
- Total trips recorded: 3549

Trip timing analysis¶

Transport Usage Analysis Summary¶

Hourly Usage Patterns:

  • Clear morning peak (8-12) for all transport modes
  • Metro volumes significantly exceed all other transportation types

Transfer Analysis¶

Average transfer waiting time: 7.45 minutes
Average transfer distance: 0.06 kilometers

Transfer Analysis Insights¶

Transfer Frequency¶

  • Bus has the highest transfer rate at ~44%, significantly above the overall average of 36.2%.
  • Metro has the lowest transfer rate at ~34%, suggesting it may serve more direct routes or complete journeys.
  • Tram and Trolleybus show transfer rates close to or slightly below the overall average.

Transfer Time¶

  • Transfer times typically range from ~150-750 seconds (2.5-12.5 minutes) across all transport types.
  • Tram shows the highest median transfer time, suggesting potentially less frequent service.
  • Metro displays notable outliers with some negative transfer times, indicating potential data quality issues or system timing anomalies.

Transfer Distance¶

  • Trolleybus transfers involve the longest distances (median ~100m).
  • Metro shows a weird distribution with many transfers happening at very short distances (0-1m) but also having numerous outliers stretching to 500m.
  • Bus and Tram show similar distance distributions with medians around 75-100m.

These patterns suggest that while metro is the dominant transport mode, bus connections play a crucial role in the overall network connectivity. The presence of negative transfer times and the unusual metro distance distribution suggest further investigation into data quality and how transfers are recorded in the system.

Geospatial analysis¶

Make this Notebook Trusted to load map: File -> Trust Notebook
Make this Notebook Trusted to load map: File -> Trust Notebook
Make this Notebook Trusted to load map: File -> Trust Notebook

Commuters vs occasional users¶

User identifiers¶

Records by media type:
media_type
desfire    41499
EMV        13231
ulc         1245
Name: count, dtype: int64
Distribution by Media Type and ID Type:
With PAN With Media ID Total Records
EMV 13231 0 13231
desfire 0 41499 41499
ulc 0 1245 1245

Summary of user identifiers:

  • EMV cards use PAN (payment card numbers)
  • Desfire and ULC cards use Media ID
  • No card type uses both identifiers
  • all cards have unique identifiers
One-time: Users who traveled only once.
Regular: Users who traveled 2-3 times.
Frequent: Users who traveled 4-9 times.
Very Frequent: Users who traveled more than 10 times.

User segments based on journey count:
One-time: 18861 users (58.8%)
Regular: 10618 users (33.1%)
Frequent: 2575 users (8.0%)
Very Frequent: 36 users (0.1%)

Commuter ratio: 2.00% of users are commuters
EMV card usage: 23.6%
Other card types: 76.4%

Weekend Public Transport Usage Patterns¶

The Saturday data reveals predominantly one-time travelers (58.8%) and regular users making 2-3 trips (33.1%), with significantly fewer frequent users (8.0%) and very few very frequent users (10+ trips) at just 0.1%.

This distribution is typical for weekend travel, with the low commuter ratio (2.34%) reflecting weekend-specific behavior. Most weekend passengers are making targeted leisure trips, shopping excursions, or social visits rather than the repetitive commuting patterns seen on weekdays.

Just 23.6% of travelers used EMV cards (typically occasional users paying standard fares), while the vast majority (76.4%) used specialized transit cards, suggesting most weekend travelers are still regular or subscription-holding passengers despite the reduced commuter pattern.

Clustering analysis¶

Applying clustering techniques to identify distinct patterns in public transport usage that might not be apparent through simple statistical analysis. Clustering helps discover natural groupings in the data based on multiple features simultaneously (trip duration, distance, speed, transport type, transfers, and time of day).

Standardize the features¶

For example, without scaling, a 5-minute difference in trip duration (small relative to the ~24 minute range) would outweigh a change from no transfer to transfer (which is the maximum possible change in that feature).

After standardization, all features are expressed in the same unit: standard deviations from the mean. This makes them directly comparable and ensures no feature dominates the distance calculations in algorithms like K-means simply because it has larger values.

K-means Clustering¶

  • works well with the numerical features in this dataset (distance, duration, speed)
  • It creates a fixed number of clusters with similar sizes, which helps identify major trip categories
  • It's computationally efficient for large datasets

Determining the Optimal Number of Clusters¶

Cluster Centers:
direct_distance_km trip_duration_min avg_speed_kmh transport_type_code is_transfer hour_of_day
0 7.288520 22.691186 20.226426 0.902816 2.034468e-01 9.820177
1 2.647106 10.021272 16.666801 0.824825 2.986500e-14 10.672575
2 3.118071 11.107689 17.348273 0.801027 1.000000e+00 11.252833
3 2.014908 9.197925 13.991220 2.684802 3.572330e-01 10.990406
  1. transport_type_code: This column was originally encoded as 0 (bus), 1 (metro), and 2 (trolleybus), 3 (tram). After clustering, we're seeing decimal values like 0.902816, which indicates that Cluster 0 contains a mix of transport types, leaning slightly toward metro (1). Cluster 3 with 2.684802 suggests it contains predominantly trolleybus and tram trips.

These "strange" values occur because cluster centers represent the average of all points in a cluster.

Cluster intrpretation¶

  • Cluster 0: Long-distance trips (7.3 km), longest duration (22.7 min), fastest speed (20.2 km/h), mix of transport types but mostly metro, 20% transfers, early morning (9.8 hour = 9:48 AM)

  • Cluster 1: Medium-distance trips (2.6 km), medium duration (10 min), medium speed (16.7 km/h), mostly metro, virtually no transfers, mid-morning (10.7 hour)

  • Cluster 2: Medium-distance trips (3.1 km), medium duration (11.1 min), medium-fast speed (17.3 km/h), mostly metro, 100% transfers, late morning (11.3 hour)

  • Cluster 3: Short-distance trips (2 km), shortest duration (9.2 min), slowest speed (14 km/h), predominantly trolley and tram, 36% transfers, mid-morning (11 hour)

DBSCAN Clustering¶

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) because:

  • It can discover clusters of arbitrary shape, which is valuable for temporal patterns in transport data
  • It doesn't require pre-specifying the number of clusters
  • It effectively identifies outliers as "noise" points, which helps isolate unusual travel patterns
Testing 30 parameter combinations on 5000 sample points...
Top DBSCAN Parameter Combinations:
eps min_samples n_clusters n_noise noise_ratio silhouette
17 0.5 50 2 4878 0.9756 0.367025
28 1.0 100 2 2086 0.4172 0.274956
22 0.7 50 4 2608 0.5216 0.256140
10 0.3 10 24 4250 0.8500 0.170623
21 0.7 20 7 1151 0.2302 0.127688
26 1.0 20 8 375 0.0750 0.126463
27 1.0 50 7 945 0.1890 0.123067
20 0.7 10 10 597 0.1194 0.122569
16 0.5 20 6 2892 0.5784 0.121187
25 1.0 10 8 194 0.0388 0.119380
Applying optimal parameters to full dataset...
Number of clusters in full dataset: 10
Noise points in full dataset: 1063 (1.90%)

Cluster sizes:
dbscan_cluster
-1     1063
 0     3692
 1    24943
 2     2173
 3     1154
 4     4851
 5    12672
 6     2059
 7     3340
 8       28
Name: count, dtype: int64
Cluster statistics:
direct_distance_km trip_duration_min avg_speed_kmh is_transfer hour_of_day
mean std mean std mean std mean mean
dbscan_cluster
-1 7.347782 4.404736 24.439284 15.892683 22.922333 10.512042 0.608655 11.669802
0 2.886126 2.160586 11.184499 7.178226 15.783792 5.290962 1.000000 11.146804
1 4.134472 2.557188 13.594033 6.863517 18.189080 6.407091 0.000000 10.417151
2 2.504481 1.612838 11.276460 7.432770 14.115128 4.351840 0.000000 10.803037
3 2.149410 1.382308 9.197708 5.843923 14.585703 4.155787 1.000000 11.487868
4 3.140819 2.313891 12.954868 9.530624 15.420983 5.587265 0.000000 10.179551
5 3.868322 2.132192 12.968212 5.935597 18.039500 6.473356 1.000000 11.038668
6 1.932517 1.297943 8.611904 5.866159 14.361062 5.226110 1.000000 11.338514
7 2.230385 1.494415 10.111399 7.020781 14.122665 5.002640 0.000000 10.597006
8 8.551643 0.466971 45.642064 2.667642 11.287417 1.000630 1.000000 12.571429

Conclusion¶

This analysis of Sofia's public transport system reveals distinct usage patterns characterized by four primary journey profiles, with metro being the dominant transit mode and showing clear morning and mid-day peak periods, due to the weekend. The clustering methods effectively identified trip segments ranging from long-distance, high-speed metro journeys to shorter, slower surface transit trips, providing valuable insights for future service planning and optimization.